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Abstract 


This paper provides a theoretical analysis of domain adaptation based on the PAC- 
Bayesian theory. We propose an improvement of the previous domain adaptation 
bound obtained by Germain et al. [1] in two ways. We first give another general¬ 
ization bound tighter and easier to interpret. Moreover, we provide a new analysis 
of the constant term appearing in the bound that can be of high interest for devel¬ 
oping new algorithmic solutions. 

1 Introduction 

Domain adaptation (DA) arises when the distribution generating the target data differs from the one 
from which the source learning has been generated from. Classical theoretical analyses of domain 
adaptation propose some generalization bounds over the expected risk of a classifier belonging to a 
hypothesis class 'H over the target domain [2, 3, 4], Recently, Germain et al. have given a general¬ 
ization bound expressed as an averaging over the classifiers in 7i using the PAC-Bayesian theory [1], 
In this paper, we derive a new PAC-Bayesian domain adaptation bound that improves the previous 
result of [1], Moreover, we provide an analysis of the constant term appearing in the bound opening 
the door to design new algorithms able to control this term. The paper is organized as follows. We 
introduce the classical PAC-Bayesian theory in Section 2. We present the domain adaptation bound 
obtained in [1] in Section 3. Section 4 presents our new results. 

2 PAC-Bayesian Setting in Supervised Learning 

In the non adaptive setting, the PAC-Bayesian theory [5] offers generalization bounds (and algo¬ 
rithms) for weighted majority votes over a set of functions, called voters. Let X C l d be the input 
space of dimension d and Y = { — 1, +1} be the output space. A domain P s is an unknown dis¬ 
tribution over X x Y. The marginal distribution of P s over A' is denoted by D s . Let H be a set 
of n voters such that: \/h £ TL, h : X —>• Y, and let n be a prior on H. A prior is a probability 
distribution on 7~L that “models” some a priori knowledge on quality of the voters of H. 

Then, given a learning sample S = {(x,. y t ) }™jj , drawn independently and identically distributed 
( i.i.d .) according to the distribution P s , the aim of the PAC-Bayesian learner is to find a posterior 
distribution p leading to a p-weighted majority vote B p over 'H that has the lowest possible expected 
risk, i.e., the lowest probability of making an error on future examples drawn from D s . More 
precisely, the vote B p and its true and empirical risks are defined as follows. 
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Definition 1. Let R be a set of voters. Let p be a distribution over R. The ^-weighted majority 
vote B p (sometimes called the Bayes classifier) is: 


Vx G X , -Bp(x) = sign 


E h(x) 


The true risk of B p on a domain P s and its empirical risk on a m s -sample S are respectively: 

-j m s 

RpABp) = E I[B p (xi) and f? s (£ p ) ^ — VI [B p (xO ± Vi \ 

(Xi,yi)~Ps rn s 


where I[a A &] is the 0-1 loss function returning 1 if a = b and 0 otherwise. Usual PAC-Bayesian 
analyses [5, 6, 7, 8, 9] do not directly focus on the risk of B p , but bound the risk of the closely related 
stochastic Gibbs classifier G p . It predicts the label of an example x by first drawing a classifier h 
from R according to p, and then it returns L(x). Thus, the true risk and the empirical on a m s -sample 
S of G p correspond to the expectation of the risks over R according to p : 


Rp.(Gp) = E R P .(h)=, E E I[/i(xi) A Vi\, 

rirsjp ('x.i,yi)~Ps hr^p 


and Rs{G p ) = E R s (h) = — V E I [/i(xj) ^ 

h~p m _ Z-' hr^n 


TUs ' h~p 
1=1 


Note that it is well-known in the PAC-Bayesian literature that the risk of the deterministic classi¬ 
fier B p and the risk of the stochastic classifier G p are related by Rp s (B p ) < 2 Rp s ( G p ). 


3 PAC-Bayesian Domain Adaptation of the Gibbs classifier 


Throughout the rest of this paper, we consider the PAC-Bayesian DA setting introduced by Germain 
et al. [1]. The main difference between supervised learning and DA is that we have two different 
domains over X x Y : the source domain P s and the target domain P t (D s and l) t are the respective 
marginals over X). The aim is then to learn a good model on the target domain P t knowing that we 
only have label information from the source domain P s . Concretely, in the setting described in [1], 
we have a labeled source sample S = {(x, , y, ) } "jj , drawn i.i.d. from P s and a target unlabeled 
sample T = {x, , drawn i.i.d. from D t . One thus desires to learn from S and T a weighted 

majority vote with the lowest possible expected risk on the target domain Rp t (B p ), i.e., with good 
generalization guarantees on Pt. Recalling that usual PAC-Bayesian generalization bound study the 
risk of the Gibbs classifier, Germain et al. [ 1] have done an analysis of its target risk Rp t (G p ), which 
also relies on the notion of disagreement between the voters: 

Ro(h,h') = E I[L(x) ^ L'(x)]. (1) 

Their main result is the following theorem. 

Theorem 1 (Theorem 4 of [ 1 ]). Let R be a set of voters. For every distribution p over R, we have: 

Rp t (G p ) < R Ps (Gp) + dis p (D s ,D t ) + \p, p * T , (2) 

where dis p (D s , Dt) is the domain disagreement between the marginals D s and Dt, 


dis p{D s ,D t ) = 


E 


{R De { h ,ti) ~ R D t {h,ti)) , 


(3) 


with p 2 (h,h') = p(h) x p(h'), and \ PtP * T = Rp t {G P * T ) + R Dt (G p ,Gp^) + Rd s (G p , G p * t ) , 
where pf = argmin p Rp t (G p ) is the best distribution on the target domain. 


Note that this bound reflects the usual philosophy in DA: It is well known that a favorable situation 
for DA arrives when the divergence between the domains is small while achieving good source per¬ 
formance [2, 3, 4]. Germain et al. [1] have then derived a first promising algorithm called PBDA for 
minimizing this trade-off between source risk and domain disagreement. 

Note that Germain et al. [1] also showed that, for a given hypothesis class R, the domain disagree¬ 
ment of Equation (3) is always smaller than the RIXR -distance of Ben-David et al. [2, 3] defined by 

\ su P(h,h')eH 2 I RD t (h,ti) - R Da (h,h')\. 
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4 New Results 


4.1 Improvement of Theorem 1 


First, we introduce the notion of expected joint error of a pair of classifiers (h, h') drawn according 
to the distribution p, defined as 


epAGp.Gp) = 


E E I[h(x) ^ y] x I[/i'(x) ± y] 

(h,h')~p z (x,y)~P s 


(4) 


Thm 2 below relies on the domain disagreement of Eq. (1), and on expected joint error of Eq. (4). 
Theorem 2. Let jibe a hypothesis class. We have 

VponU, Rp t (G p ) < Rp a (G p ) + ^dis p (D s , D t ) + \ p , (5) 

where \ p is the deviation between the expected joint errors of G p on the target and source domains: 


A p 


def 


e Pt (Gp,Gp)-e Ps (Gp,Gp) 


Proof. First, note that for any distribution P on A' x Y, with marginal distribution D on X, we have 


Rp(G p ) = 1 R D (Gp,Gp) + e P (Gp,Gp ), 


as 2 R P (Gp) = 


E 


E 


(h,h')~p 2 (x,y)~P 


(l[/i(x) ^ y] + I[/i/(x) ^ y]) 


(l X I[M X ) ^ /i'(x)] + 2 x l[h(x) ± y} I[/z'(x) ^ y]) 
= Rd(G p , Gp) + 2 x ep(G p , G p ). 


E E 

(h,h')~p 2 (x,y)~P 


Therefore, 


f? Pt (G p ) - f?p s (Gp) = i (fl Dt (Gp, Gp) - R d . (Gp, Gp)) + (e Pt (G p , G p ) - e Pa (G p , G p )) 


< 


Ro t {Gp, G p ) - Rp s {G p , G p ) + ep t (Gp,G p ) — ep s (G p ,Gp) 


— 77 dis p (D s , D t ) + Ap . 


□ 


The improvement of Theorem 2 over Theorem 1 relies on two main points. On the one hand, our 
new result contains only the half of dis p (D s ,D t ). On the other hand, contrary to A PiP ^ of Eq. (2), 
the term X p of Eq. (5) does not depend anymore on the best pj- on the target domain. This implies 
that our new bound is not degenerated when the two distributions P s and P t are equal (or very close). 
Conversely, when P s = P t , the bound of Theorem 1 gives 

Rp t {Gp) < Rp t {G p ) + Rp t (G p * T ) + 2 R Dt (G p , G p ^), 

which is at least 2 Rp t {G p P). Moreover, the term 2Rp> t (G p , G p *) is greater than zero for any p 
when the support of p and p* T in 'H is constituted of at least two different classifiers. 


4.2 A New PAC-Bayesian Bound 

Note that the improvements introduced by Theorem 2 do not change the form and the philosophy of 
the PAC-Bayesian theorems previously presented by Germain et al. [1], Indeed, following the same 
proof technique, we obtain the following PAC-Bayesian domain adaption bound. 

Theorem 3. For any domains P s and Pt (resp. with marginals D s and D t ) over X x Y, any set of 
hypothesis jj, any prior distribution 7r over PL, any 8 £ (0,1], any real numbers a > 0 and c > 0, 
with a probability at least 1 — 5 over the choice of S X T ~ (P s x Dp)™, for every posterior 
distribution p on Tj, we have 

Rp t (Gp) < d Rs{Gp)+a' § dis p (S,r)+ ("- + —) KL W + ^7 +Xp + \{a'-l), 

\ C OL / TYl 

where X p is defined bv Eq. (6), and where c' = -, and a' = - — . 

1 — e~ c 1 — e~ Za 
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4.3 On the Estimation of the Unknown Term A p 

The next proposition gives an upper bound on the term A p of Theorems 2 and 3. 

Proposition 4. Let 'H. be the hypothesis space. If we suppose that P s and Pt share the same support, 
then 

VponU , A p < \Jx 2 (-PtH-Ps) ep„(G p , G p ) , 

where ep s ( G pi G p ) is the expected joint error on the source distribution, as defined by Eq. (4), and 
X 2 (Pt\\Ps) is the chi-squared divergence between the target and the source distributions. 


Proof. Supposing that P t and P s have the same support, then we can upper bound A p using Cauchy - 
Schwarz inequality to obtain line 4 from line 3. 


= 

E 

E 


(h,h')~p 2 


— 

E 

E 




— 

E 

E 


(h,h')~p 2 

(x,2/W 


\ E 

( PA , 

< 


V (x,y)~p a 

\Ps{ 


Pt(x, y) 


(x,y)~P s 


I[Mx) ^ y}l[h'(x) ± y) - E I[/i(x) ^y\I[hfx) ^ y] 

(x,y)~P s 


Pt(*,y) 


- 1 i[M x ) + y\ i[fr'( x ) ^ y\ 


- i 


E E {l[h(-x)^y]l[h'{x)ty\Y 

C h,h')~p 2 (x,y)~P s 


< 


Pt(x,y) 

(x,y)~P a \P s (x,y) 


E 


-1 


x E E / y\ I[h'(x) ^ y\ 

(h,h')~p 2 (x,y)~P a 


{jUM, fy- 1 ) 2xep - <G '’ G ' ) = ^(P4r.)e P .(0 e ,G e ). 


□ 


This result indicates that A p can be controlled by the term e p s , which can be estimated from samples, 
and the chi-squared divergence between the two distributions that we could try to estimate in an 
unsupervised way or, maybe more appropriately, use as a constant to tune, expressing a tradeoff 
between the two distributions. This opens the door to derive new learning algorithms for domain 
adaptation with the hope of controlling in part some negative transfer. 
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